Faster K-Means Cluster Estimation

نویسندگان

  • Siddhesh Khandelwal
  • Amit Awekar
چکیده

K-means is a widely used iterative clustering algorithm. There has been considerable work on improving k-means in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose two heuristics to overcome this bottleneck and speed up k-means. Our first heuristic predicts the candidate clusters for each data point by looking at nearby clusters after first iteration of k-means. Our second heuristic further reduces this candidate cluster list aggressively. We augment well known variants of k-means with our heuristics to demonstrate effectiveness of our heuristics. For various synthetic and real-world datasets, our heuristics achieve speed-up of up-to 10 times without significant increase in MSE.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inter Cluster Distance Management Model with Optimal Centroid Estimation for K-Means Clustering Algorithm

Clustering techniques are used to group up the transactions based on the relevancy. Cluster analysis is one of the primary data analysis method. The clustering process can be done in two ways such that Hierarchical clusters and partition clustering. Hierarchical clustering technique uses the structure and data values. The partition clustering technique uses the data similarity factors. Transact...

متن کامل

X-means: Extending K-means with Eecient Estimation of the Number of Clusters

Despite its popularity for general clustering, K-means suuers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the rst two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we int...

متن کامل

شناسایی الگوی رفتار مردم در اهدای خون با استفاده از الگوریتم K-Means مبتنی بر تازگی، بسامد و ارزش خون

Introduction: Blood donation rate in developed countries is 18 times higher than developing countries. It is estimated that if only five percent of Iran population embark on blood donation, it will be adequate to meet the needs of the community. The aim of this paper is to identify the blood donators’ loyalty behavior for proper planning to extend and enhance blood donation habits among t...

متن کامل

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...

متن کامل

An Efficient Document Clustering Based on HUBNESS Proportional K-Means Algorithm

Evaluating similarity between the documents is a main operation in the text processing field. Similarity measurement is used to estimate the relationship between the records or documents.In existing system similarity between two documents can be computed with respect to feature by using Similarity Measure for Text Processing (SMTP). In proposed hybrid SMTP scheme is integrated with hubness base...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017